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' Abstract. We solve the problems of detecting and counting various 

I I , forms of regularities in a string represented as a Straight Line Program 
^AJ ■ (SLP). Given an SLP of size n that represents a string s of length A^, 

II ' our algorithm compute all runs and squares in s in 0{n^h) time and 
jy! , 0(n^) space, where h is the height of the derivation tree of the SLP. We 
^ ' also show an algorithm to compute all gapped-palindromes in 0{rC'h + 

' 'I gnhlog N) time and 0{n^) space, where g is the length of the gap. The 

, . key technique of the above solution also allows us to compute the periods 

s^ ' and covers of the string in 0{n^h) time and 0(nh(n + \o^ N)) time, 

l^^_ , respectively. 

o : 

t^^ , 1 Introduction 

("^ \ Finding regularities such as squares, runs, and palindromes in strings, is a fun- 

fT^ ' damental and important problem in stringology with various applications, and 

^-H I many efficient algorithms have been proposed (e.g., [12I6I1I7I13I2I9] ). See also [S] 

^ ' for a survey. 

• '~j \ In this paper, we consider the problem of detecting regularities in a string 

r> ' s of length N that is given in a compressed form, namely, as a straight line 

jrt ! program (SLP), which is essentially a context free grammar in the Chomsky 

normal form that derives only s. Our model of computation is the word RAM: 
We shall assume that the computer word size is at least [logj iV] , and hence, 
standard operations on values representing lengths and positions of string s can 
be manipulated in constant time. Space complexities will be determined by the 
number of computer words (not bits) . 

Given an SLP whose size is n and the height of its derivation tree is /i, 
Bannai et al. [3] showed how to test whether the string s is square-free or 
not, in 0{n^h\ogN) time and 0{n^) space. Independently, Khvorost [S] pre- 
sented an algorithm for computing a compact representation of all squares in 
s in 0{n^h\og N) time and 0{n^) space. Matsubara et al. [14] showed that a 
compact representation of all maximal palindromes occurring in the string s can 



be computed in 0{n^h) time and 0{n^) space. Note that the length N of the 
decompressed string s can be as large as 0(2") in the worst case. Therefore, in 
such cases these algorithms are more efficient than any algorithm that work on 
uncompressed strings. 

In this paper we present the following extension and improvements to the 
above work, namely, 

1. an 0(n'^/i)-time 0(n^ )-space algorithm for computing a compact represen- 
tation of squares and runs; 

2. an 0{n^h + gnh log A'^)-time 0(n^)-spacc algorithm for computing a compact 
representation of palindromes with a gap (spacer) of length g. 

We remark that our algorithms can easily be extended to count the number of 
squares, runs, and gapped palindromes in the same time and space complexities. 

Note that Result [1] improves on the work by Khvorost [5] which requires 
0{n^h\o^ N) time and Oijn?') space. The key to the improvement is our new 
technique of Section 13.31 called approximate doubling, which we believe is of 
independent interest. In fact, using the approximate doubling technique, one 
can improve the time complexity of the algorithms of Lifshits [10] to compute 
the periods and covers of a string given as an SLP, in 0{n^h) time and 0{nh{n + 
log'^ N)) time, respectively. 

If we allow no gaps in palindromes (i.e., if we set g = 0), then Result [5] implies 
that we can compute a compact representation of all maximal palindromes in 
0{n^h) time and O(n^) space. Hence, Result [2] can be seen as a generalization 
of the work by Matsubara et al. [M] with the same efficiency. 

2 Preliminaries 

2.1 Strings 

Let S be the alphabet, so an element of S* is called a string. For string s ~ xyz, 
X is called a prefix, y is called a substring, and z is called a suffix of s, respectively. 
The length of string s is denoted by \s\. The empty string e is a string of length 
0, that is, \e\ = 0. For 1 < i < \s\, s[i] denotes the i-th character of s. For 
1 < * < J < |s|i s[i..j] denotes the substring of s that begins at position i and 
ends at position j. For any string s, let s^ denote the reversed string of s, that 
is, s^' = s[\s\]- ■ ■ s[2]s[l]. For any strings s and u, let lcp{s,u) (resp. lcs{s,u)) 
denote the length of the longest common prefix (resp. sufiix) of s and u. 

We say that string s has a period c (0 < c < |s|) if s[z] = s[i + c] for any 
1 < i < |s| — c. For a period c of s, we denote s = u^, where u is the prefix of 
s of length c and q = '-^. For convenience, let u" ~ s . li q > 2, s = u*^ is called 
a repetition with root u and period |m|. Also, we say that s is primitive if there 
is no string u and integer fc > 1 such that s = u^ . If s is primitive, then s^ is 
called a square. 

We denote a repetition in a string s by a triple {b,e,c) such that s[6..e] is a 
repetition with period c. A repetition {b,e,c) in s is called a run (or maximal 



periodicity in |11| ) if c is the smallest period of s[5..e] and the substring cannot 
be extended to the left nor to the right with the same period, namely neither 
s[b — l..e] nor s[b..e + 1] has period c. Note that for any run (b, e, c) in s, every 
substring of length 2c in s[5..e] is a square. Let Run{s) denote the set of all runs 
in s. 

A string s is said to be a palindrome if s = s^. A string s said to be a gapped 
palindrome if s ~ xux^ for some string u G S* . Note that u may or may not be 
a palindrome. The prefix x (resp. suffix x^) of xux^ is called the left arm (resp. 
right arm) of gapped palindrome xuu^' . If |u| = g, then xux^ is said to be a 
g-gapped palindrome. We denote a maximal g-gapped palindrome in a string s 
by a pair (6, e)g such that s[5..e] is a ^-gapped palindrome and s\b — l..e + 1] is 
not. Let gPals(s) denote the set of all maximal g-gapped palindromes in s. 

Given a text string s £ 17+ and a pattern string p S S'^, we say that p 
occurs at position i (1 < i < |s| — |p| + 1) iff s[i..i + \p\ — 1] = p. Let Occ{s,p) 
denote the set of positions where p occurs in s. For a pair of integers 1 < b < e, 
[b,e] = {b,b + 1, . . . , e} is called an interval. 

Lemma 1 ([15]). For any strings s,p £ S'^ and any interval [b, e] with 1 < b < 
e <b+ \p\, Occ{s,p) n [b,e] forms a single arithmetic progression if Occ{s,p) n 



2.2 Straight-line programs 

A straight-line program (SLP) S of size n is a set of productions S = {Xi — ^ 
expr^}f^^, where each Xi is a distinct variable and each expr^ is either expri = 
XgXr {I < £,r < i), or expri = a, for some a G S. Note that Xn derives only a 
single string and, therefore, we view the SLP as a compressed representation of 
the string s that is derived from the variable Xn ■ Recall that the length TV of the 
string s can be as large as 0(2"). However, it is always the case that n > logN. 
For any variable Xi, let val{Xi) denote the string that is derived from variable 
Xi. Therefore, val{Xn) = s. When it is not confusing, we identify Xi with the 
string represented by Xi. 

Let Ti denote the derivation tree of a variable Xi of an SLP S. The derivation 
tree of 5 is T„ (see also Fig. [5] in Appendix C). Let height (Xi) denote the height 
of the derivation tree Ti of Xi and height (S) = height (Xn). We associate each 
leaf of Ti with the corresponding position of the string val{Xi). For any node z 
of the derivation tree Ti, let £z be the number of leaves to the left of z in Tj. 
The position of z in Ti is i?^ + 1. 

Let [u,v] be any integer interval with 1 < u < v < \val{Xi)\. We say that 
the interval [u,v] crosses the boundary of node z in Ti, if the lowest common 
ancestor of the leaves u and v in Ti is z. We also say that the interval [u,v] 
touches the boundary of node z in Ti, if either [u — 1, w] or [m, w + 1] crosses the 
boundary of z in Tj. Assume p = w[u..u + \p\ — 1] and interval [u, u + \p\ ~ 1] 
crosses or touches the boundary of node z in T^. When z is labeled by Xj, then 
we also say that the occurrence of p starting at position u in val(Xi) crosses or 
touches the boundary of Xj . 



Lemma 2 ([4])- Given an SLP S of size n describing string w of length N, 
we can pre-process S in 0{n) time and space to answer the following queries in 
O(logiV) time: 

— Given a position u with 1 <u < N , answer the character w[u]. 

— Given an interval [u, v] with \ < u < v < N , answer the node z the interval 
[u,w] crosses, the label Xi of z, and the position of z in Tg — Tn- 

For any production Xi -^ XgXr and a string p, let Occ^{Xi,p) be the set of 
occurrences of p which begin in Xf and end in Xr- Let S and 7" be SLPs of sizes 
n and m, respectively. Let the AP-table for S and 7" be an n x to table such 
that for any pair of variables X ^ S and Y ^ T the table stores Occ^{X,Y). 
It follows from Lemma [1] that OccA{X,Y) forms a single arithmetic progression 
which requires 0(1) space, and hence the AP-table can be represented in 0{nm) 
space. 

Lemma 3 ([10]). Given two SLPs S and T of sizes n and m, respectively, the 
AP-table for S and T can be computed in 0{nmh) time and 0{nm) space, where 
h — height (S). 

Lemma 4 ( |10| . local search (LS)). Using AP-table for S and T that de- 
scribe strings p in s, we can compute, given any position b and constant a > 0, 
Occ(s,p) n[b,b-\- a\p\] as a form of at most \a\ arithmetic progressions in 0(h) 
time, where h = height (S). 

Note that, given any 1 < i < j < |s|, we are able to build an SLP of size 
0{n) that generates substring s[i..j] in 0{n) time. Hence, by computing the 
AP-table for S and the new SLP, we can conduct the local search LS operation 
on substring s[i..j] in 0{n^h) time. 

For any variable Xi of S and positions 1 < ki,k2 < \Xi\, we define the 
"right-right" longest common extension query by 

LCE(X„fci,A:2) = lcpiX,[h..\X,\],X,[k2..\X,\]). 

Using a technique of [T3] in conjunction with Lemma [31 it is possible to answer 
the query in 0(ri^h) time for each pair of positions, with no pre-processing. 
We will later show our new algorithm which, after 0(n^/i)-time pre-processing, 
answers to the LCE query for any pair of positions in 0{hlogN) time. 

3 Finding runs 

In this section we propose an 0{n^h)-time and 0(n^)-space algorithm to com- 
pute 0{n log A^ )-size representation of all runs in a text s of length N represented 
by SLP S = {X,^ expr,}2^^ of height h. 

For each production Xi —5- X£ti\Xj.ii\ with i < n, we consider the set 
Run^{Xi) of runs which touch or cross the boundary of Xi and are completed 
in Xi, i.e., those that are not prefixes nor suffixes of Xi. Formally, 

Run^{Xi) = {(6,e,c) G Run{Xi) | 1 < 6-1 < \Xe^i-)\ < e + 1 < \Xi\}. 



It is known that for any interval [b, e] with 1 < & < e < |s|, there exists a unique 
occurrence of a variable Xi in the derivation tree of SLP, such that the interval 
[b,e\ crosses the boundary of Xi. Also, wherever Xi appears in the derivation 
tree, the runs in Run^{Xi) occur in s with some appropriate offset, and these 
occurrences of the runs are never contained in Rurr{Xj) with any other variable 
Xj with j ^ i. Hence, by computing Rurr{Xi) for all variables Xi with i < n, 
we can essentially compute all runs of s that are not prefixes nor suffixes of s. 
In order to detect prefix/suffix runs of s, it is sufficient to consider two auxiliary 
variables X„+i -^ X$Xn and Xn+2 — >■ Xn+iX$i^ where X$ and X$i respectively 
derive special characters $ and $' that are not in s and $ ^ $'. Hence, the 
problem of computing the runs from an SLP S reduces to computing Rurr{Xi) 
for all variables Xi with i <n + 2. 

Our algorithm is based on the divide-and-conquer method used in [3] and 
also [8], which detect squares crossing the boundary of each variable Xi. Roughly 
speaking, in order to detect such squares we take some substrings of val{Xi) as 
seeds each of which is in charge of distinct squares, and for each seed we detect 
squares by using LS and LCE constant times. There is a difference between [3] 
and [S] in how the seeds are taken, and ours is rather based on that in [3]. 
In the next subsection, we briefly describe our basic algorithm which runs in 
0{n^h\ogN) time. 

3.1 Basic algorithm 

Consider runs in Rurr{Xi) with Xi — > XgXj.. Since a run in Run^{Xi) contains 
a square which touches or crosses the boundary of Xi, our algorithm finds a run 
by first finding such a square, and then computing the maximal extension of its 
period to the left and right of its occurrence. 

We divide each square ww by its length and how it relates to the boundary 
of Xj. When |w;| > 1, there exists 1 < t < log |m/(Xj)| such that 2* < \w\ < 2*+^ 
and there are four cases (see also Fig.[T|); (1) \wi\ > ||w|, (2) ^\w\ > \we\ > \w\, 
(3) \w\ > {wil > -^Iwl, (4) ^l^l > lif^l, where wi is a prefix of ww which is also 
a suffix of val{Xi). 

The point is that in any case we can take a substring p of length 2*"^ of s 
which touches the boundary of Xi, and is completely contained in w. By using 
p as a seed we can detect runs by the following steps: 

Step 1: Conduct local search of p in an "appropriate range" of Xi, and find a 

copy p' (= p) of p. 
Step 2: Compute the length plen of the longest common prefix to the right of 

p and p' , and the length slen of the longest common suffix to the left of p 

and p' , then check that plen + slen > d~\p\, where d is the distance between 

the beginning positions of p and p' . 

Notice that Step 2 actually computes maximal extension of the repetition. 

Since d = \w\, it is sufficient to conduct local search in the range satisfying 
2* < d < 2'+^, namely, the width of the interval for local search is smaller than 
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Fig. 1. The left arrows represent the longest common suffix between the left substrings 
immediately to the left of p and p' . The right arrows represent the longest common 
prefix between the substrings immediately to the right of p and p' . 

2|p|, and all occurrences of p' are represented by at most two arithmetic progres- 
sions. Although exponentially many runs can be represented by an arithmetic 
progression, its periodicity enables us to efficiently detect all of them, by using 
LCE only constant times, and they are encoded in 0{1) space. We put the details 
in Appendix A since the employed techniques are essentially the same as in [5]. 

By varying t from 1 to log TV, we can obtain an 0(log A'')-size compact rep- 
resentation of Rurr{Xi) in 0{'n?h\ogN) time. More precisely, we get a list of 
0(log N) quintuplets {5i , (52, (^3, c, k) such that the union of sets U7=o ('^i ~cj, 62 + 
cj, S3 + cj) for all elements of the list equals to Rurr^Xi) without duplicates. By 
applying the above procedure to all the n variables, we can obtain an 0{n log A^)- 
size compact representation of all runs in s in 0{ii?h\og N) time. The total space 
requirement is 0{v?), since we need 0{v?) space at each step of the algorithm. 

In order to improve the running time of the algorithm to 0{n^h), we will use 
new techniques of the two following subsections. 



3.2 Longest common extension 



In this subsection we propose a more efficient algorithm for LCE queries. 

Lemma 5. We can pre-process an SLP S of size n and height h in 0{n'^h) time 
and 0{n^) space, so that given any variable Xi and positions 1 < fci,fc2 < \Xi\, 
LCE{Xi,ki,k2) is answered in 0{h\ogN) time. 



To compute LCE(X,;, fci, fc2) we will use the followiug function: For an SLP 
S = {Xi -^ expri}^^^, let Match be a function such that 

., ,,^^ ,x [true a k e Occ{Xi,X.j), 
Matchi Xi,Xi,k) = { ^ ■" 

^ [false a k(^ Occ{X„Xj). 



Lemma 6. We can pre-process a given SLP S of size n and height h in 0{n^h) 
time and 0{n^) space so that the query Match(Xi, Xj, k) is answered in 0(log N) 
time. 

Proof. We apply Lemma [5] to every variable Xi of 5, so that the queries of 
Lemma [5] is answered in 0(log N) time on the derivation tree Ti of each variable 
Xi of S. Since there are n variables in S, this takes a total of 0{n^) time and 
space. We also apply Lemma[3]to S, which takes 0{n^h) time and 0{n^) space. 
Hence the pre-processing takes a total of 0{n^h) time and 0{n^) space. 

To answer the query Match{Xi,Xj, k), we first find the node of T^ the interval 
[k, k + \Xj I — 1] crosses, its label Xq, and its position r in T^. This takes 0(log N) 
time using Lemma[5] Then we check in 0(1) time if {k — r) <E Occ^{Xq,Xj) or 
not, using the arithmetic progression stored in the AP-table. Thus the query is 
answered in 0(log A^) time. D 

The following function will also be used in our algorithm: Let FirstMismatch 
be a function such that 



FirstMismatch(X,;,Xj, k) 



{\lcp{X,[k..\X,\],Xj)\ if \X,\ -k + l<\Xj 
I undefined otherwise. 



Using Lemma [6] we can establish the following lemma. See Appendix B for a 
full proof. 

Lemma 7. We can pre-process a given SLP S of size n and height h in 0{n^h) 
time and 0{n^) space so that the query FirstMismatch(Ai, Aj, fc) is answered in 
0{h\ogN) time. 

We are ready to prove Lemma [S] 

Proof. Consider to compute LCE(Ai, fci, ^2). Without loss of generality, assume 
fci < ^2- Let z be the lea of the fci-th and (fc2 — fci + | A,; |)-th leaves of the derivation 
tree Tj. Let Pi be the path from z to the fci-th leaf of the derivation tree Ti, and 
let L be the list of the right child of the nodes in Pg sorted in increasing order of 
their position in Ti. The number of nodes in L is at most height{Xi) < h, and L 
can be computed in 0{height{Xi)) = 0{h) time. Let Pr be the path from z to 
the (/c2 — fci-|-|Ai|)-th leaf of the derivation tree Ti, and let R be the list of the left 
child of the nodes in Pr sorted in increasing order of their position in T^. R can 
be computed in 0{h) time as well. Let U = LUR = {A„(i), A„(2)7 • ■ • yXu{m)} be 
the list obtained by concatenating L and R. For each X^tpj in increasing order 

oi p = 1, 2, . . . , m, we perform query Match(Ai, Atj(p), /ci + J2q=i \-^u(q) \) until 



either finding the first variable Xy^ip,\ for which the query returns false (see also 
Fig. [5] in Appendix C), or all the queries for p = 1, . . . , 771 have returned true. 
In the latter case, clearly LCE(Xi, fci, ^2) = \Xi\ — ki + 1. In the former case, 
the first mismatch occurs between Xi and X„(p/), and hence LCE(Xi, fci, ^2) = 



Xu(q')\ + FirstMismatch(Xj,X„(p,),fci +Y.q'Jl l-'^n(g')l)- 



Ep'-l 

Since U contains at most 2 • height{Xi) variables, we perform 0{h) Match 
queries. We perform at most one FirstMismatch query. Thus, using Lemmas [S] 
and [3 we can compute LCE{Xi,ki,k2) in 0(/7,logA^) time after 0(n^/i)-time 
0(ri^)-space pre-processing. D 

We can use Lemma [5] to also compute "left-left" , "left-right" , and "right- 
left" longest common extensions on the uncompressed string s = val{S): We 
can compute in 0{n) time an SLP S^ of size n which represents the reversed 
string s^ [M] . We then construct a new SLP S' of size 2n and height ft, + 1 by 
concatenating the last variables of S and S^ , and apply Lemma [5] to S' . 



3.3 Approximate doubling 

Here we show how to reduce the number of AP-table computation required in 
Step 1 of the basic algorithm, from 0(log A^) to 0(1) times per variable. 

Consider any production Xi — >■ XiXr- If we build a new SLP which contains 
variables that derive the prefixes of length 2* of Xr for each < t < log \Xr\, 
we can obtain the AP-tables for X^ and all prefix seeds of X^ by computing the 
AP-table for Xi and the new SLP. Unfortunately, however, the size of such a 
new SLP can be as large as 0(nlog A^). Here we notice that the lengths of the 
seeds do not have to be exactly doublings, i.e., the basic algorithm of Section [Ol 
works fine as long as the following properties arc fulfilled: (a) the ratio of the 
lengths for each pair of consecutive seeds is constant; (b) the whole string is 
covered by the 0(log A^) seeds Q We show in the next lemma that we can build 
an approximate doubling SLP of size 0{n). 

Lemma 8. Let S = {Xi — !> expr^}f^^ be an SLP that derives a string s. We 
can build in 0{n) time a new SLP S' = {Yi — > expr'^}"^]^ with n' = 0{n) 
and height{S') = 0{height{S)), which derives s and contains 0{logN) variables 
Yai , Ya2 , • ■ • , Yak Satisfying the following conditions: 

— For any 1 < j < k, Ya derives a prefix of s, lYaJ = 1 and \Ya^\ = \s\. 

- For any 1<J <k, \Ya^ \ < \Ya^^, \ < 2|r,J. 

Proof. First, we copy the productions of S into S' . Next we add productions 
needed for creating prefix variables Yai , Ya^ , ■ • ■ , Ya^. in increasing order. We con- 
sider separating the derivation tree r„ of Ar„ into segments by a sequence of nodes 
Vi,V2, ■ ■ ■ ,Vk such that the i-th segment enclosed by the path from Vi to Vi+i 
represents the suffix of Ya^^^ of length 15^;+ J — \Ya-\, namely, Ya-^-^ — > Ya-Yi,. 



* A minor modification is that we conduct local search for a seed p at Step 1 with the 
range satisfying 2\p\ < d < 2\q\, where q is the next longer seed of p. 



where Yf,. is a variable for the i-th segment. Each node Vi is called an 1-node 
(resp. r-node) if the node belongs to the left (resp. right) segment of the node. 
We start from vi which is the leftmost node that derives s[l]. Suppose we 
have built prefix variables up to Ya. and now creating Ya-^-^ . At this moment we 
are at w,;. We move up to the node Ui such that Ui is the deepest node on the 
path from the root to Vi which contains position 2|yo. |, and move down from 
Ui towards position 2|y(j. |. The traversal ends when we meet a node w.^+i which 
satisfies one of the following conditions; (1) the rightmost position of w.j+i is 
2|yaJ, (2) Ui+i is labeled with Xj, and we have traversed another node labeled 
with Xj before. 

— If Condition (1) holds, Vi+i is set to be an 1-node. It is clear that the length 
of the j-th segment is exactly \Ya- \ and l^ai+i I = 2|yai |- 

— If Condition (1) does not hold but Condition (2) holds, tij+i is set to be an 
r-node. Since Ui+i contains position 2|ya. |, the length of the «-th segment 
is less than \Ya-\ and l^ai+il < SlFaJ- We remark that since Xj appears in 
Yai+i , then lYai+i | + \Xj \ < 2\Ya-^^ \ , and therefore, we never move down Vi+i 
for the segments to follow. 

We iterate the above procedures until we obtain a prefix variable Ya^_-^ that 
satisfies |X„| < 2|ya^_J. We let Uk be the deepest node on the path from the 
root to Vk-i which contains position |s|, and let Vk be the right child of u^- Since 
\YaA < 2|ya.+2| for any 1 < i < fc, fc = 0{logN) holds. 

We note that the i-th. segment can be represented by the concatenation of 
"inner" nodes attached to the path from Vi to Wi+i, and hence, the number of 
new variables needed for representing the segment is bounded by the number of 
such nodes. Consider all the edges we have traversed in the derivation tree T„ 
of Xn- Each edge contributes to at most one new variable for some segment (see 
also Fig. [7] in Appendix C). Since each variable Xj is used constant times for 
moving down due to Condition (2), the number of the traversed edges as well 
as n' is 0{n). Also, it is easy to make the height of Yb^ be 0{height{S)) for any 
l<i <k. Thus O {height {S')) = 0{\ogN + height (S)) = O {height (S)). D 

3.4 Improved algorithm 

Using Lemmas [S] and |S1 we get the following theorem. 

Theorem 1. Given an SLP S of size n and height h that describes string s 
of length N, an 0{n\og N)-size compact representation of all runs in s can he 
computed in 0{n^h) time and 0{n^) working space. 

Proof. Using Lemma [3 we first pre-process <S in 0{n'^h) time so that any "right- 
right" or "left-left" LCE query can be answered in 0{h\ogN) time. For each 
variable X^ — >• X^Xr, using Lemma El we build temporal SLPs T and T' which 
have respectively approximately doubling suffix variables of X^ and prefix vari- 
ables of Xr, and compute two AP-tables for S and each of them in 0{n^h) time. 
For each of the O(logiV) prefix/suffix variables, we use it as a seed and find 



case (1) 



case (2) 



J^" 



— ► -<- 



IZJ 



— ► ■*- 



case (3) 



^' 



Fig. 2. Three groups of ,g-gapped palindromes to be found in Xi. 

all corresponding runs by using LS and LCE queries constant times. Hence the 
time complexity is 0{ri?h + n(in?h -\- {h + hlogN) logiV)) = 0{n^h). The space 
requirement is 0{n?), the same as the basic algorithm. D 

4 Finding gr-gapped palindromes 

A similar strategy to finding runs on SLPs can be used for computing a compact 
representation of the set gPals(s) of g-gappcd palindromes from an SLP S that 
describes string s. As in the case of runs, wc add two auxiliary variables Xn+i — ^ 
X%Xn and X„+2 -^ Xn+iXp. For each production Xi — > XiXr with i < n + 2, 
we consider the set gPals [Xi) of g-gapped palindromes which touch or cross 
the boundary of Xi and are completed in Xi, i.e., those that are not prefixes nor 
suffixes of Xi . Formally, 

gPals^iXi) = {(5,e)g e gPals{X,) | 1 < &-1 < \Xi\ < e + 1 < \Xi\}. 

Each g-gappcd palindrome in Xi can be divided into three groups (see also 
Fig. [2]); (1) its right arm crosses or touches with its right end the boundary of 
Xi, (2) its left arm crosses or touches with its left end the boundary of Xi, (3) 
the others. 

For Case (3), for every \Xi\-g + l < j < \Xi\we check if lcp{Xi[l..j]'^, X,[j + 
g + l..\Xi\]) > or not. From Lemma[5l it can be done in 0{gh\ogN) time for 
any variable by using "left-right" LCE (excluding pre-processing time for LCE). 
Hence we can compute all such g-g&pped palindromes for all productions in 
0{-n?h + gnhlogN) time, and clearly they can be stored in 0{ng) space. 

For Case (1), let wt be the prefix of the right arm which is also a suffix of 
val{Xi). We take approximately doubling sufBxes of Xi as seeds. Let p be the 
longest seed that is contained in wi. We can find g-gapped palindromes by the 
following steps: 
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Step 1: Conduct local search of p' = p^ in an "appropriate range" of Xi and 

find it in the left arm of palindrome. 
Step 2: Compute "right-left" LCE of p' and p, then check that the gap can be g. 

The outward maximal extension can be obtained by computing "left-right" 

LCE queries on the occurrences of p' and p. 

As in the case of runs, for each seed, the length of the range where the local 
search is performed in Step 1 is only 0{\p\). Hence, the occurrences of p' can 
be represented by a constant number of arithmetic progressions. Also, we can 
obtain 0(l)-space representation of 5-gappcd palindromes for each arithmetic 
progression representing overlapping occurrences of p', by using a constant num- 
ber of LCE queries. Therefore, by processing OilogN) seeds for every variable 
Xi, we can compute in 0{v?h + n{in?h + {h + h\ogN)\ogN)) = 0{n^h) time an 
0(nlog-/V)-size representation of all g-gapped palindromes for Case (1) in s. 

In a symmetric way of Case (1), we can find all 5-gapped palindromes for 
Case (2). Putting all together, we get the following theorem. 

Theorem 2. Given an SLP of size n and height h that describes string s of 
length N, and non-negative integer g, an 0{n\ogN + ng)-size compact represen- 
tation of all g-gapped palindromes in s can be computed in 0{n^h -\- gnhlog N) 
time and 0{n^) working space. 



5 Discussions 

Let K and G denote the output compact representations of the runs and (7-gapped 
palindromes of a given SLP 5, respectively, and let |R| and |G| denote their size. 
Here we show an application of R and G; given any interval [&, e] in s, we can 
count the number of runs and gapped palindromes in s[b..e] in 0{n -\- |R|) and 
0(n-|-|G|) time, respectively. We will describe only the case of runs, but a similar 
technique can be applied to gapped palindromes. As is described in Section [321 
s[b..e] can be represented by a sequence U = (Xt((i),XK(2)7 • ■ • ,-'^u(m)) of 0{h) 
variables of S. Let T be the SLP obtained by concatenating the variables of U. 
There are three different types of runs in R: (1) runs that are completely within 
the subtree rooted at one of the nodes of U; (2) runs that begin and end inside 
[b, e] and cross or touch any border between consecutive nodes of U; (3) runs 
that begin and/or end outside [6, e]. Observe that the runs of types (2) and (3) 
cross or touch the boundary of one of the nodes in the path from the root to 
the 6-th leaf of the derivation tree T5, or in the path from the root to the e-th 
leaf of T^. A run that begins outside [b,e] is counted only if the suffix of the 
run that intersects [b,e] has an exponent of at least 2. The symmetric variant 
applies to a run that ends outside [b, e]. Thus, the number of runs of types (2) 
and (3) can be counted in 0{n -\- 2|R|) time. Since we can compute in a total of 
0{n) time the number of nodes in the derivation tree of T that are labeled by 
Xi for all variables Xi, the number of runs of type (1) for all variables ^„(j) can 
be counted in 0{n -\- |R|) time. Noticing that runs are compact representation 
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of squares, we can also count the number of occurrences of all squares in s[6..e] 
in 0{n + |M|) time by simple arithmetic operations. 

The approximate doubling and LCE algorithms of Section [3] can be used 
as basis of other efficient algorithms on SLPs. For example, using approximate 
doubling, we can reduce the number of pairs of variables for which the AP-table 
has to be computed in the algorithms of Lifshits |10| . which compute compact 
representations of all periods and covers of a string given as an SLP. As a result, 
wc improve the time complexities from 0{n^h\ogN) to 0{n?h) for periods, and 
from 0{n?h\og N) to 0{nh(n + log N)) for covers. 
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Appendix A: Details of the algorithm to find runs 

In this section, we describe how we process occurrences of p' at Step 2 of the basic 
algorithm. To handle occurrences of p' that are represented by an arithmetic 
progression, we make use of its periodicity. 

For any string s and positive integer c < \s\, let repc{s) (resp. repc{s)) denote 
the length of the longest prefix (resp. suffix) of s having period c. 

Lemma 9. Let s,p Cz S^ and {og, ai, . . . ,ak} be consecutive occurrences ofp in 
s that form a single arithmetic progression with common difference c< \p\. Let 
Zj = s[aj + |p|..|s|] and z' = s[l..aj — 1] for any < j < k. For any non-empty 
strings x,x' G S^ , it holds that 



lcp{zj,x) 



min{ a — cj, p } 

^ + lcp{zo0 + l..|zo|],2;[^ + l-\x 



^, 



lcs{z'j,x') = < ^ 



min{ a + cj, /3 } 



^. 



^, 



ifli- cj ^ P, 
otherwise, and 

tf'b^ + cj^t, 



P + lcs{z'^[l..\z'^\ - I3],x'[l..\x'\ - P]) otherwise 



^ 



where a = repdpzo) — \p\, p = repc{px) — \p\, a = repc{zQp) — \p\ and '/3 
fepcix'p). 

Proof. Since repdpzj) = a — cj + \p\, both pzj and px have a prefix of length 
min{a —cj, p} + \p\ with period c (see also Fig. [3]) . If a —cj ^ p , either pzj or 
px has a prefix of length min{ a — cj, p} + \p\ + 1 with period c while the other 
does not, and hence lcp{zj,x) — lcp{pzj,px) — \p\ = min{ a —cj, p}. Only when 

the period breaks the periodicity, i.e., a —cj — p , lcp{zj, x) could expand. Note 
that such expansion occurs at most once. Similarly, since repc{z'^p) = a +cj we 
get the statement for /cs(z',a;'). D 




a, Oi ai a, a, 

labbcabcabcabcabcabcabcabca!cbaabl x 



H 



' I P- I 



* 1 P. I 



I P- I 



\ P- \ , 

\ P, \ , 



1 I " I ^ I 



s lab'bcabcabcabcabcabcabcabcacbaabl 



cabcabcacbaacl 



I- 



Fig. 3. Illustration for Lemma[9l 

In the next lemma, we show how to handle one of the arithmetic progressions 
computed in Step 2 of Case (3) . 
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Lemma 10. Let Xi — >■ XgXr be a production of an SLP of size n and p be the 
suffix of val{Xi) of length 2*^^. Let {ao,ai, . . . ,0^} be consecutive occurrences 
of p' in val{Xi) which form a single arithmetic progression, which are computed 
in Step 2 of Case (3). We can detect all runs corresponding to the occurrences 
of p' by using LCE constant times. Also, such runs are represented in constant 
space. 

Proof. We apply Lemma IH] by letting s = val{Xi), x — val{Xr) and x' = 
val{X£)[l..\val{Xe)\ — \p\]. First wc compntc a = lcp{pzo,p[c+l..\p\]zf)) + c~\p\, 
a = lcp{px,p[c + l..\p\]x) + c — \p\, a = Ics{zqP,Zqp[1..\p\ — c]) + c — \p\ and 
/3 = lcs{x'p,x'p[l..\p\ — c]) + c — \p\ by using Icp and Ics four times. 

Claim, li p + a > oq — 1 + c, the root of any repetition detected from aj is not 
primitive. 

Proof of Claim. If p + a > ao — 1 + c, pyp must have period c, where y is the 
prefix of length ai — 1 of x. Since pyp[c + l..c + p] = p, \yp\ — c is a period of 
yp. It follows from the periodicity lemma that py, as well as every aj + \p\ — 1, 
is divisible by greatest common divisor of c and \yp\ — c, and hence the root of 
any repetition detected from Oj is not primitive. D 

From the above claim, in what follows we assume that p + a < oq — 1 -\- c. 
Let dj = Oj — 1 + IpI = ao — 1 + \p\ +cj, and then we want to check if lcp{zj, x) + 
lcs{z'j,x') > dj — IpI = flo — 1 + cj, or cquivalently, lcp{zj,x) + lcs{zj,x') — cj > 
ao - 1. 

Let j' = min{j >0\~(^ - cj <p} and j" = minj?' > \ ^ + cj > ^ }. For 
any < j < min{j'',j"}, it follows from lcp{zj,x) = p and lcs{z',,x') = a + cj 

that lcp{zj, x) + lcs{z',,x') — cj = p + a , and hence a repetition {5i — cj, 82 + 

■' \ ' y 

cj, 5z+cj) appears iff /3 + a > ao — 1, where 5i = \x'\ + l~- a ,82 = ao + |p|+ /3 — 1 
and Js = ao + \p\ — 1 are constants. 

We show that the root of such repetition {5i — cj, 82 + cj, (^3 + cj) is primitive. 
Assume on the contrary that it is not primitive, namely, s' = s[5i — CJ..S2 + cj] = 
u« with \u\ < {S3 + cj)/2 and g > 4. Evidently, reads') =~P + lcs{z'j,x') + \p\ = 

p + a + \p\ + cj. It follows from cq — 1< p+a<ao — 1 + c that 6$ + cj < 
rei)c{s') < 63 + cj + c < \s'\. Since 2|it| < reads') and c < |p| < {63 + cj)/2 < S3 + 
cj —\u\ < repc{s') — \u\, repc{s'[l..\s'\ — \u\) = repc{s'[\u\ + l..|s'|]) + |u|, however 
both s'[l..|s'| — |u|] and s'[|u| + l..|s'|] are u'^~^, a contradiction. Therefore, for all 
< j < min{/, 7"}, {Si — cj, S2 + cj, S3 + cj) are runs, and they can be encoded 
by a quintuplet ((5i,(52, (53,c, min{j', j"}). 

For any min{j'', j"} < j < k except for j = j' or j" , lcp{zj, x) + lcs{z',, x') —cj 
is monotonically decreasing by at least c and satisfies lcp{zj,x) + lcs{zj, x') — cj < 

p+a— c<ao — 1, and hence, no repetition appears. For j' and j" , we can 
check whether these two occurrences become runs or not by using LCE constant 
times. D 
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Fig. 4. Illustration for Lemma 1101 Four runs are found. Here j' = 3 and j" — 2. 
The runs from po and pi are encoded by a quintuplet. For each j' and j" , the run is 
separately encoded by a quintuplet that shows a single run. 

The other cases can be processed in a similar way. 

A minor technicality is that we may redundantly find the same run in different 
cases. However, we can avoid duplicates by simply looking into the currently 
computed runs when we add new runs, spending 0{logN) time. Also, we can 
remove repetitions whose root are not primitive by just choosing the smallest 
period among the repetitions with the same interval. 
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Appendix B: Proof of Lemma [7] 

Proof. The outline of our algorithm to compute FirstMismatch follows [T3] which 
used a slower algorithm for Match. Assume \Xi\ — k + 1 < \Xj\ holds. 
If Xj -^ a with a E S, then 



FirstMismatch(Xi,Xj, fc) = 



1 if Match{Xi,Xj, k) = true, 
^' ""^ ^ "^ if Match(X,,Xj, k) ^ false. 



liXj — > X£(^j-jXr{j), then we can recursively compute First Mismatch(Xi,Xj, k) 
as follows: 

F\rstM\5match{Xi, Xj , k) 

{F\rstM\smatch{Xi,X^(^j^,k + iXil) ii Match{Xi,X£(^j^,k) = true, 
F\rstM\smatch{Xi, X^j-j,k) if Match{Xi, Xi(^j^,k) = false. 

We apply Lemma IHl to S, pre-processing SLP S in 0{n^h) time and 0{n^) 
space, so that query Match{Xi, Xj' , k') is answered in 0(log iV) time for any vari- 
able Xji and integer k'. Note that in either case of Equation [U the height of the 
second variable decreases by 1. Hence we can compute FirstMismatch(Xi,Xj, k) 
in 0{h\ogN) time, after the 0(n^/i)-time 0(n^ )-space pre-processing. D 
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Appendix C: Figures 




Xi Xi X\ Xi Xi Xi 



abbabbbabbabbb 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 



Fig. 5. The derivation tree of SLP S — {X\ — )► a, 
X1X2, ^5 -^ X1X3, Xe — > X4X3, X7 — >■ X-^Xts, Xii 
s — abbabbbabbabbb. 



X2 — >■ b, X3 — >■ X2X2, X4 ^ 
— !> XyXy }, representing string 




Fig. 6. Lemma [5l Illustration for computing LCE{Xi,ki,k2). The roots of the gray 
subtrees are labeled by the variables in U. We find the first variable X^jp/j in the list 
U with which the Match query returns false. We then perform the FirstMismatch query 
for Xi and Xu(p') using the appropriate offset. 
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Fig. 7. Lemma [8] Illustration for approximate doubling. The prefix variables up to 
Ya^ have been created. The traversals for V2, V3, V4, end due to Condition f and that 
for 115 ends due to Condition 2. Each traversed edge (depicted in bold) contributes to 
at most one new variable for some segment. Next, we will resume the traversal from 115 
targeting position 2|ya5|, and iterate the procedure until we get the last variable YJ^.. 
The total number of bold edges can be bounded by 0{n) thanks to Condition 2. 
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